Scaling Up Word Clustering
نویسندگان
چکیده
Word clusters improve performance in many NLP tasks including training neural network language models, but current increases in datasets are outpacing the ability of word clusterers to handle them. In this paper we present a novel bidirectional, interpolated, refining, and alternating (BIRA) predictive exchange algorithm and introduce ClusterCat, a clusterer based on this algorithm. We show that ClusterCat is 3–85 times faster than four other well-known clusterers, while also improving upon the predictive exchange algorithm’s perplexity by up to 18% . Notably, ClusterCat clusters a 2.5 billion token English News Crawl corpus in 3 hours. We also evaluate in a machine translation setting, resulting in shorter training times achieving the same translation quality measured in BLEU scores. ClusterCat is portable and freely available.
منابع مشابه
Word clustering effect on vocabulary learning of EFL learners: A case of semantic versus phonological clustering
The aim of this study is to determine the effect of word clustering method on vocabulary learning of Iranian EFL learners through a case of semantic versus phonological clustering. To this effect, 80 homogeneous students from four intermediate classes at an English institute in Torbat e Heydariyeh participated in this research. They were assigned to four groups according to semantic versus phon...
متن کاملA Synchronic Lexical Study of Gbe Language Varieties: The Effects of Different Similarity Judgment Criteria
In the context of a synchronic lexical study of the Gbe varieties of West Africa, this paper explores the question whether the use of different criteria sets to judge the similarity of lexical features in different language varieties yields the same or different conclusions regarding the relative relationships and clustering of the investigated varieties and the prioritization of further sociol...
متن کاملFuzzy Clustering Approach Using Data Fusion Theory and its Application To Automatic Isolated Word Recognition
In this paper, utilization of clustering algorithms for data fusion in decision level is proposed. The results of automatic isolated word recognition, which are derived from speech spectrograph and Linear Predictive Coding (LPC) analysis, are combined with each other by using fuzzy clustering algorithms, especially fuzzy k-means and fuzzy vector quantization. Experimental results show that the...
متن کاملOffline Language-free Writer Identification based on Speeded-up Robust Features
This article proposes offline language-free writer identification based on speeded-up robust features (SURF), goes through training, enrollment, and identification stages. In all stages, an isotropic Box filter is first used to segment the handwritten text image into word regions (WRs). Then, the SURF descriptors (SUDs) of word region and the corresponding scales and orientations (SOs) are extr...
متن کاملDevelopment of Meaning Structure by Usage-based Word Relationships
Development of meaning structure is studied from a usage-based viewpoint by a constructive approach. The meaning structure is represented by relationships between words. A word's relationship to other words, which represents meanings of the word, is derived by analyzing similarity of the word's usage in sentences. Words make clusters according to their similarity. The word clusters are classi e...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016